<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
	<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
	<TITLE>JGrid 0.2: RMI-based Java interface for SGE</TITLE>
	<META NAME="GENERATOR" CONTENT="StarOffice 6.0  (Solaris Sparc)">
	<META NAME="CREATED" CONTENT="20030131;9290500">
	<META NAME="CHANGED" CONTENT="20030131;10055300">
	<STYLE>
	<!--
		H2 { font-family: "Sunsans Demi" }
		H3 { font-family: "Sunsans Demi" }
	-->
	</STYLE>
</HEAD>
<BODY LANG="en-US">
<H1>JGrid 0.2.1: an RMI-based Java interface for Grid Engine</H1>
<H2>Overview</H2>
<P>This document will attempt to explain what the JGrid package is
and does, how to install it, how to configure it and how to develop
with it. This software is still in an early state of development. The
intention is to provide a transactional Java interface to the Sun
Grid Engine product. This is accomplished by intercepting RMI
communications on the wire and feeding them through the grid engine.
The net result is that a Java client can make use of the grid without
knowing anything about the grid. More importantly, because the JGrid
package intercept live RMI calls, the Java client is also to pass
live objects to the grid and receive live objects in response. What's
more, because of how the serialized RMI call is deserialized, there's
no JVM startup overhead; the Java process which executes the job on
the execution hosts is always running.</P>
<H2>Contents</H2>
<OL>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#Introduction|outline">Introduction</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#Installation|outline">Installation</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#Configuration|outline">Configuration</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#Testing|outline">Testing</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#Development|outline">Development</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#Futures|outline">Futures</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#Disclaimer|outline">Disclaimer</A></P>
</OL>
<P STYLE="margin-bottom: 0cm"><BR>
</P>
<P>The JGrid package is now available via CVS.  You can browser the source
<A HREF="http://gridengine.sunsource.net/source/browse/gridengine/source/experimental/jgrid/">here</A>.</P>
<HR>
<H2><A NAME="Introduction|outline"></A>Introduction</H2>
<P>To begin with, this document will explain the concepts behind how
JGrid works.</P>
<H3>The Architecture</H3>
<P>The JGrid package has three major components, the server, the
native peer, and the agent.</P>
<H4>The Server</H4>
<P>The Server is a Java server which uses smart RMI proxies to 
intercept RMI calls and feed them through the grid.  It uses two RMI
registries: one for the interface to the client, and one for the interface
to the agent.
<H4>The Native Peer</H4>
<P>The job the server submits to the grid is to run the native peer on
the execution host. The native peer is a native binary that connects to
the agent and requests the Java job be run.  The native peer remains running
as long as the Java job runs so that SGE has a handle to use for dealing with
the Java job.</P>
<H4>The Agent</H4>
<P>The agent is a Java server that runs on each execution
host. The native peer contacts the agent via a custom protocol
(JCEP: Java Compute Engine Protocol) and sends the
Java job request. The agent then executes the
requested job and returns the results to the server via RMI. The server can
then return those results to the client.</P>
<H4>Advantages</H4>
<P>This architecture has several advantages. The first and foremost
is that it allows a Java client to interact with the grid via a well
known Java programming model, i.e. RMI, in a transactional context.
Also, because RMI allows for code mobility the code to be executed on
the client's behalf does not have to be known the server. As long as
the class files are posted in an accessible location, the server can
dynamically retrieve the binaries need to execute the client's
request. The server allows clients to submit both synchronous and
asynchronous jobs. The native peer saves the time and memory
overhead associated with starting a new VM for every request. Because
the compute client is always running, each execution host only has to
start the VM once.</P>
<HR>
<H2><A NAME="Installation|outline"></A>Installation</H2>
<P>Before you begin, J2SE 1.4 must be installed on all the systems in
the grid. First install Sun Grid Engine. Next, make sure each host
has access to the JGrid.jar file, either through a shared filesystem
or through individual copies. Next choose a host to be the proxy
server. The proxy host is the only host through which Java clients
will be able to submit jobs. On the proxy host run the server by
typing: 
</P>
<PRE STYLE="margin-bottom: 0.5cm">java -cp JGrid.jar com.sun.grid.jgrid.proxy.ComputeServer</PRE><P>
Next, make sure each execution host has access to the native peer
(found in proxy.jar under com/sun/grid/skel/JCEPskeleton) and the skel.sh
script. On each execution host run the agent by typing:</P>
<PRE>java -cp JGrid.jar com.sun.grid.jgrid.server.JCEPHandler -reghost hostname</PRE><P>
This, of course, is not likely to work out of the box. In order for JGrid to
function, read the next section on configuring the system.</P>
<P>Note that the agents must be started <B>after</B> the
server.</P>
<HR>
<H2><A NAME="Configuration|outline"></A>Configuration</H2>
<H3>Configuring the Server</H3>
<P>The server has a variety of command line switches to control its
behavior. The most useful are: -sub, -skel, and -d.</P>
<H4>-sub</H4>
<P>The -sub switch allows the command used to submit a job to the
queue to be set. In not used, the submission command defaults to
&quot;qsub -notify&quot;. If path information needs to be included or another
command must be used, use -sub. For example:</P>
<PRE STYLE="margin-bottom: 0.5cm">java com.sun.grid.jgrid.proxy.ComputeServer -sub &quot;/sge/bin/solaris64/qsub -notify&quot;</PRE>
<H4>-skel</H4>
<P>The -skel switch allows the command to be executed on the
execution host to be set. If not used, the command defaults to
&quot;skel.sh&quot;. If path information needs to be included or another
command must be used, use -skel. For example:</P>
<PRE STYLE="margin-bottom: 0.5cm">java com.sun.grid.jgrid.proxy.ComputeServer -skel /sge/jgrid/skel.sh</PRE>
<H4>-d</H4>
<P>The -d switch allows the job file directory to be set. If not
used, the job directory defaults to &quot;ser&quot;. If another
directory is desired, use -d. For example:</P>
<PRE STYLE="margin-bottom: 0.5cm">java com.sun.grid.jgrid.proxy.ComputeServer -d /sge/jgrid/jobs</PRE>
<H4>Other Switches</H4>
<P>For more information about command lines switches for the server type:</P>
<PRE STYLE="margin-bottom: 0.5cm">java com.sun.grid.jgrid.proxy.ComputeServer -help</PRE>
<H3>Configuring the Native Peer</H3>
<P>The native peer binary cannot be run directly by SGE, so a script,
skel.sh, has been included which starts up the skeleton. To change
the job file directory used by the skeleton, edit the script and
change the WORK_DIR variable.</P>
<P>The native peer can also be used independent of SGE for testing purposes
mostly.  For more information, run JCEPskeleton (or skel.sh) with no parameters.</P>
<H3>Configuring the Agent</H3>
<P>The agent has a variety of command line switches to control its
behavior. The most useful are: -reghost and -grace.</P>
<H4>-reghost</H4>
<P>The -reghost switch tells the agent on which host to find the RMI registry
containing the result channel.  For example:</P>
<PRE STYLE="margin-bottom: 0.5cm">java com.sun.grid.jgrid.server.JCEPHandler -reghost master.germany.sun.com</PRE>
<H4>-grace</H4>
<P>Since JGrid allows the submission of arbitrary jobs, it must be assumed that
not all jobs will be well behaved.  Since Java does not provide (anymore) a
facility for killing an executing thread without killing the whole VM, canceling
a job consists of asking the job nicely to stop executing.  If it refuses,
there's nothing that can be done about it short of stopping the VM. 
Additionally, the Job class is written such that a call to cancel() only
&quot;guarantees&quot; that the job will stop &quot;as soon as possible&quot;.
Unfortunately, there really is no way to tell a job that is slow to stop from a
job that has gone rogue and will never stop.  To deal with this problem, the
JCEPHandler uses a grace period when shutting down.  The JCEPHandler will ask
each running job to stop and then wait for a set amount of time before
explicitly calling System.exit() to stop the VM.  The amount of time to wait is
set with the -grace switch.  For example, to set the grace period to 30 seconds
one would use:</P>
<PRE STYLE="margin-bottom: 0.5cm">java com.sun.grid.jgrid.server.JCEPHandler -grace 30</PRE>
<H4>Other Switches</H4>
<P>For more information about command lines switches for the agent type:</P>
<PRE STYLE="margin-bottom: 0.5cm">java com.sun.grid.jgrid.server.JCEPHandler -help</PRE>
<H3>Configuring the VM</H3>
<P>The server requires access to several
directories and much of the network. To avoid copious security
errors, a modified security policy file must be specified when
running the server. An appropriately modified security file is
included in the tar file. If the SGE install and the job file
directory are somewhere other than in /sge, the policy file will need
to be modified. To set the security file, type: 
</P>
<PRE STYLE="margin-bottom: 0.5cm">java -Djava.security.policy=/sge/java.policy com.sun.grid.jgrid.proxy.ComputeServer</PRE>
<HR>
<H2><A NAME="Testing|outline"></A>Testing</H2>
<P>To test if the system is working, do the following:</P>
<P>On the proxy host type:</P>
<PRE>java -Djava.security.policy=/sge/jgrid/java.policy -cp JGrid.jar:client.jar \
     com.sun.grid.jgrid.proxy.ComputeServer -sub /sge/bin/solaris64/qsub \
     -skel &quot;/sge/jgrid/skel.sh submit&quot; -d /sge/jgrid/jobs</PRE><P>
Replace the path information with information that is correct for
grid. When the server indicates it's ready, on the execution host
type:</P>
<PRE>java -Djava.security.policy=/sge/jgrid/java.policy -cp JGrid.jar:client.jar \
     com.sun.grid.jgrid.server.JCEPHandler -reghost proxy_host</PRE><P>
Replace proxy_host with the name of the server on which the server is
running. When the server indicates it's ready, on a client type:</P>
<PRE>java -Djava.security.policy=/sge/jgrid/java.policy -cp JGrid.jar:client.jar \
     dant.test.ComputeClient proxy_host 1099</PRE><P>
Replace proxy_host with the name of the server on which the server is
running. If everything is working, the
execution host should print &quot;This is a test&quot; in the job output file
(skel.sh.o[0-9]+) along with some lifecycle messages and the client
should indicate that it received the value &quot;TEST&quot; from the
server. If everything isn't working, try rerunning the above commands with the
-debug switch to see copious amounts of debug output that will hopefully tell
you something useful.  You may have to kill the server and agent (CTRL-C) to get them to print their output.</P>
<P>To test if the code mobility is working, do the following:</P>
<P>Make the client.jar available via http. On the proxy host type:</P>
<PRE>java -Djava.security.policy=/sge/jgrid/java.policy -cp JGrid.jar \
     com.sun.grid.jgrid.proxy.ComputeServer -sub /sge/bin/solaris64/qsub \
     -skel &quot;/sge/jgrid/skel.sh submit&quot; -d /sge/jgrid/jobs</PRE><P>
Replace the path information with information that is correct for
grid. When the server indicates it's ready, on the execution host
type:</P>
<PRE>java -Djava.security.policy=/sge/jgrid/java.policy -cp JGrid.jar
     com.sun.grid.jgrid.server.JCEPHandler -reghost proxy_host</PRE><P>
Replace proxy_host with the name of the server on which the server in
running. When the server indicates it's ready, on a client type:</P>
<PRE>java -Djava.security.policy=/sge/jgrid/java.policy -Djava.rmi.server.codebase=url \
     -cp JGrid.jar:client.jar dant.test.ComputeClient proxy_host 1099</PRE><P>
Replace proxy_host with the name of the server on which the server in
running. Replace url with the fully qualified URL where the
client.jar is stored. If everything is working,
the execution host should print &quot;This is a test&quot; and the
client should indicate that it received the value &quot;TEST&quot;
from the server. The difference between this test and the previous
test is that in the previous test the server and compute agents had
client.jar included in their classpaths. In this test, instead of
including client.jar in the classpath, client.jar was made available
via http and exported by the client via the codebase property. The
ComputeTest object that the compute engine executed came from code
that was moved dynamically over the network.</P>
<P>Note that whenever you explicitly set the submission command with -sub, if
you want your jobs to be suspendable and cancelable, you have to give the
command as &quot;/path/qsub -notify&quot;.  This is required because the native
peer needs time to relay the command to the job before itself being suspended
or killed.  Similarly, whenever setting the native peer via the -skel switch,
you must always give it a parameter, normally &quot;submit&quot;.</P>
<HR>
<H2><A NAME="Development|outline"></A>Development</H2>
<P>Developing with the JGrid is easy. A developer only needs to know
two interfaces: com.sun.grid.jgrid.Computable and
com.sun.grid.jgrid.ComputeEngine. A complete API doc is included in the
tar.</P>
<P>The Computable interface must be implemented by any object passed
to JGrid for execution. Computable inherits from Serializable, so any
Computable class must also be Serializable. The Computable interface
has five methods: compute(), checkpoint(), cancel(), suspend(), and resume
().</P>
<P>The compute() method is called by the compute engine when the
job is assigned to an execution host. If a problem occurs during
execution, the Computable class can throw a ComputeException.  If everything
goes to plan, the compute() method returns a Serializable results object.</P>
<P>The cancel() method is used to stop the job.  The job does not have to be
stopped when the cancel() method returns.  It only has to be in the process of
stopping.  This is currently to prevent the job from having to implement
complicated synchronizations.  This will change soon.</P>
<P>The checkpoint() method forces the job to make sure it's serializable
attributes are up to date.  The assumption is that a job that is checkpointable
will do it's work in transient variables and store it's state in non-transient
variables, preferably only after a checkpoint() call.  In this way, the job's
state can be keep consistent.  The downside is that any work that gets done
between the serialization of the job and the cancelation of the job gets
lost.</P>
<P>The suspend() and resume() methods are used to suspend and resume jobs,
respectively.  The suspend() method, like cancel(), is not required to leave
the job in a suspended state, only in a suspending state.  Like with cancel(),
this will also change soon.<P>
<P>cancel(), suspend(), and checkpoint() can all throw a NotInterruptableException
is the job does not implement that facility.  A job that cannot be canceled
can only be killed by stopping the VM.</P>
<P>The ComputeEngine interface is implemented by the
server. The ComputeEngine interface
allows clients to submit jobs synchronously or asynchronously. To
submit a job synchronously, the client calls the compute() method,
passing it a Computable object. The method returns the Serializable
results object. To submit a job asynchronously, the client calls the
computeAsynch() method, passing a Computable object. The method
returns the job id. The client can then call isComplete(), passing
the job id, to find out if the job has completed. The client can also
call getResults(), passing the job id, or get
the Serializable results object if the job has completed.</P>
<P>To get a reference to the ComputeEngine, the client must connect
to the RMI registry on the proxy host and do a lookup for
&quot;ComputeEngine&quot;.</P>
<P>See the API docs for more info.</P>
<HR>
<H2><A NAME="Futures|outline"></A>Futures</H2>
<P>I finally have a pretty clear idea of where I'm going with JGrid.  The
following are the major things on my to-do list, in order of execution:</P>
<UL>
<LI>DONE -- Replace handwritten JRMP code in ComputeProxy with custom RMI classes</LI>
<LI>Replace spooling jobs to disk with spooling jobs to a JavaSpace</LI>
<LI>Replace ResultChannel with an update to the spooled job in the
JavaSpace</LI>
<LI>Make the agents themselves be SGE jobs for better control and visibility</LI>
<LI>Add user authentication and authorization</LI>
<LI>Replace fork & exec of qsub, et al with DRMAA calls</LI>
<LI>Provide an interface for controling job parameters and agents from the
client</LI>
</UL>
<HR>
<H2><A NAME="disclaimer"></A><A NAME="Disclaimer|outline"></A>Disclaimer</H2>
<P>JGrid is a product of my creation. Sun in no way supports,
acknowledges or lends legitimacy to JGrid. I take no
responsibility for any losses resulting from the use of Jgrid. Please
direct any questions or comments to me at:</P>
<P>Daniel Templeton<BR><A
HREF="mailto:dan.templeton@sun.com">dan.templeton@sun.com</A><BR>
+49 941/3075 220<BR>
Dr-Leo-Ritter-Strasse 7, D-93049 Regensburg</P>
</P>
</BODY>
</HTML>

